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Abstract 


In this work, we study distance metric learning (DML) for high dimensional data. A typical 
approach for DML with high dimensional data is to perform the dimensionality reduction first be¬ 
fore learning the distance metric. The main shortcoming of this approach is that it may result in a 
suboptimal solution due to the subspace removed by the dimensionality reduction method. In this 
work, we present a dual random projection frame for DML with high dimensional data that explic¬ 
itly addresses the limitation of dimensionality reduction for DML. The key idea is to first project all 
the data points into a low dimensional space by random projection, and compute the dual variables 
using the projected vectors. It then reconstructs the distance metric in the original space using the 
estimated dual variables. The proposed method, on one hand, enjoys the light computation of ran¬ 
dom projection, and on the other hand, alleviates the limitation of most dimensionality reduction 
methods. We verify both empirically and theoretically the effectiveness of the proposed algorithm 
for high dimensional DML. 

Keywords: Distance Metric Learning, Dual Random Projection 

1. Introduction 

Distance metric learning (DML) is essential to many machine learning tasks, including ranking (Chechik 
et ah, 2010; Lim et ah, 2013), fc-nearest neighbor (fc-NN) classification (Weinberger and Saul, 2009) 
and A;-means clustering (Xing et ah, 2002). It finds a good mefric by minimizing fhe disfance be- 
fween dafa pairs in fhe same classes and maximizing fhe disfance befween dafa pairs from differenf 
classes (Xing ef ah, 2002; Globerson and Roweis, 2005; Yang and Jin, 2006; Davis el ah, 2007; 
Weinberger and Saul, 2009; Shaw ef ah, 2011). The main compulalional challenge of DML arises 
from fhe conslrainl lhal fhe learned malrix has fo be positive semi-definile (PSD). If is compufa- 
lionally demanding even wilh a slochaslic gradienl descenl (SGD) because if has lo projecf fhe 
intermediate solulions onto fhe PSD cone al every ileralion. In a recenl sludy (Chechik ef ah, 2010), 
fhe aulhors show empirically fhal if is possible fo learn a good disfance mefric using online learning 
wilhoul having fo perform fhe projection al each ileralion. In fad, only one projeclion info fhe PSD 
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cone is performed at the end of online learning to ensure that the resulting matrix is PSD ^ Our 
study of DML follows the same paradigm, to which we refer as one-projection paradigm. 

Although the one-projection paradigm resolves the computational challenge from projection 
onto the PSD cone, it still suffers from a high computational cost when each data point is described 
by a large number of features. This is because, for d dimensional data points, the size of learned ma¬ 
trix will be 0{d^), and as a result, the cost of computing the gradient, the fundamental operation for 
any first order optimization method, will also be 0{d^). The focus of this work is to develop an ef¬ 
ficient first order optimization method for high dimensional DML that avoids 0{d^) computational 
cost per iteration. 

Several approaches have been proposed to reduce the computation cost for high dimensional 
DML. In (Davis and Dhillon, 2008), the authors assume that the learned metric M is of low rank, 
and write it as M = LLJ, where L £ with r d. Instead of learning M, the authors 

proposed to learn L directly, which reduces the cost of computing the gradient from 0{d^) to 
0{dr). A similar idea was studied in (Weinberger and Saul, 2009). The main problem with this 
approach is that it will result in non-convex optimization. An alternative approach is to reduce 
the dimensionality of data using dimensionality reduction methods such as principal component 
analysis (PCA) (Weinberger and Saul, 2009) or random projection (RP) (Tsagkatakis and Savakis, 
2010). Although RP is computationally more efficient than PCA, it often yields significantly worse 
performance than PCA unless the number of random projections is sufficiently large (Fradkin and 
Madigan, 2003). We note that although RP has been successfully applied to many machine learning 
tasks, e.g., classification (Rahimi and Recht, 2007), clustering (Boutsidis et ah, 2010) and regres¬ 
sion (Maillard and Munos, 2012), only a few studies examined the application of RP to DML, and 
most of them with limited success. 

In this paper, we propose a dual random projection approach for high dimensional DML. Our 
approach, on one hand, enjoys the light computation of random projection, and on the other hand, 
significantly improves the effectiveness of random projection. The main limitation of using ran¬ 
dom projection for DML is that all the columns/rows of the learned metric will lie in the subspace 
spanned by the random vectors. We address this limitation of random projection by 

• first estimating the dual variables based on the random projected vectors and, 

• then reconstructing the distance metric using the estimated dual variables and data vectors in 
the original space. 

Since the final distance metric is computed using the original vectors, not the randomly projected 
vectors, the column/row space of the learned metric will NOT be restricted to the subspace spanned 
by the random projection, thus alleviating the limitation of random projection. We verify the effec¬ 
tiveness of the proposed algorithms both empirically and theoretically. 

We finally note that our work is built upon the recent work (Zhang et ah, 2013) on random 
projection where a dual random projection algorithm is developed for linear classification. Our 
work differs from (Zhang et ah, 2013) in that we apply the theory of dual random projection to 
DML. More importantly, we have made an important progress in advancing the theory of dual 
random projection. Unlike the theory in (Zhang et ah, 2013) where the data matrix is assumed to be 
low rank or approximately low rank, our new theory of dual random projection is applicable to any 

I. We note that this is different from the algorithms presented in (Kazan and Kale, 2012; Mahdavi et ah, 2012). Although 
these two algorithms only need either one or no projection step, they introduce additional mechanisms to prevent the 
intermediate solutions from being too far away from the PSD cone, which could result in a significant overhead per 
iteration. 


2 



Towards Making High Dimensional Distance Metric Learning Practical 


data matrix even when it is NOT approximately low rank. This new analysis significantly broadens 
the application domains where dual random projection is applicable, which is further verified by our 
empirical study. 


The rest of the paper is organized as follows: Section 2 introduces the methods that are related 
to the proposed method. Section 3 describes the proposed dual random projection approach for 
DML and the detailed algorithm for solving the dual problem in the subspace spanned by random 
projection. Section 4 summarizes the results of the empirical study, and Section 5 concludes this 
work with future directions. 


2. Related Work 


Many algorithms have been developed for DML (Xing et ah, 2002; Globerson and Roweis, 2005; 
Davis et ah, 2007; Weinberger and Saul, 2009). Exemplar DML algorithms are MCML (Glober¬ 
son and Roweis, 2005), ITML (Davis et ah, 2007), LMNN (Weinberger and Saul, 2009) and OA¬ 
SIS (Chechik et ah, 2010). Besides algorithms, several studies were devoted to analyzing the gener¬ 
alization performance of DML (Jin et ah, 2009; Belief and Habrard, 2012). Survey papers (Yang and 
Jin, 2006; Kulis, 2013) provide detailed investigation about the topic. Although numerous studies 
were devoted to DML, only a limited progress is made to address the high dimensional challenge 
in DML (Davis and Dhillon, 2008; Weinberger and Saul, 2009; Qi et ah, 2009; Lim et ah, 2013). 
In (Davis and Dhillon, 2008; Weinberger and Saul, 2009), the authors address the challenge of high 
dimensionality by enforcing the distance metric to be a low rank matrix. (Qi et ah, 2009; Lim et ah, 
2013) alleviate the challenge of learning a distance metric M from high dimensional data by as¬ 
suming M to be a sparse matrix. The main shortcoming of these approaches is that they have to 
place strong assumption on the learned metric, significantly limiting their application. In addition, 
these approaches will result in non-convex optimization problems that are usually difficult to solve. 
In contrast, the proposed DML algorithm does not have to make strong assumption regarding the 
learned metric. 


Random projection is widely used for dimension reduction in vaiious learning tasks (Rahimi and 
Recht, 2007; Boutsidis et ah, 2010; Maillard and Munos, 2012). Unfortunately, it requires a large 
amount of random projections for the desired result (Lradkin and Madigan, 2003), and this limits its 
application in DML, where the computational cost is proportion to the square of dimensions. Dual 
random projection is first introduced for linear classification task (Zhang et ah, 2013) and following 
aspects make our work significantly different from the initial study (Zhang et ah, 2013): Lirst, we 
apply dual random projection for DML, where the number of variables is quadratic to the dimension 
and the dimension crisis is more serious than linear classifier. Second, we opfimize fhe dual problem 
direcfly rafher fhan fhe primal problem in fhe subspace as fhe previous work. Consequenfly, non- 
smoofhed loss (e.g., hinge loss) could be used for fhe proposed mefhod. Lasf, we give fhe fheorefical 
guarantee when fhe dafasef is nof low rank, which is an imporfanf assumpfion for fhe sfudy (Zhang 
ef ah, 2013). All of fhese efforls try to efficiently learn a distance metric for high dimensional 
datasets and sufficient empirical study verifies fhe success of our mefhod. 
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3. Dual Random Projection for Distance Metric Learning 

Let X = (xi, • • • , x„) G denote the collection of training examples. Given a PSD matrix M, 
the distance between two examples Xj and Xj is given as 

= (xj - Xj)’^M(xi - Xj). 

The proposed framework for DML will be based on triplet constraints, not pairwise constraints. 
This is because several previous studies have suggested that triplet constraints are more effective 
than pairwise constraints (Weinberger and Saul, 2009; Chechik et ah, 2010; Shaw et ah, 2011). Let 
V = {(xJ, xj, x^),..., (x^, x^, x^)} be the set of triplet constraints used for training, where x* 
is expected to be more similar to xj than to xj,. Our goal is to learn a metric function M that is 
consistent with most of the triplet constraints in V, i.e. 

V(xj,xj,xj) G V, (xj-xj)^M(xj-xj) + l < (xj-xj)^M(xj-xj) 

Following the empirical risk minimization framework, we cast the triplet constraints based DML 
into the following optimization problem: 


A 1 

min -||M||| + -(1) 

t=i 

where Sd stands for the symmetric matrix of size d x d, A > 0 is the regularization parameter, £{■) 
is a convex loss function, At = (xj — xj,)(xj — xj,)^ — (xj — xj)(xj — xj)"'^, and (•, •) stands for the 
dot product between two matrices. We note that we did not enforce M in (1) to be PSD because we 
follow the one-projection paradigm proposed in (Chechik et ah, 2010) that first learns a symmetric 
matrix M by solving the optimization problem in (1) and then projects the learned matrix M onto 
the PSD cone. We emphasize that unlike (Zhang et ah, 2013), we did not assume £(•) to be smooth, 
making it possible to apply the proposed approach to the hinge loss. 

Let £*(•) be the convex conjugate of £(•). The dual problem of (1) is given by 

1 ^ 1 


N 


y^^atAj 


t=i 


which is equivalent to 


max 

aG[-l,0]^ 


N 


t=l 


1 

2MV 


Oi^Ga 


( 2 ) 


where a = (ai, • • • , and G = is a matrix of x with Ga,b = {Aa, A^). We 

denote by M* G the optimal primal solution to (1), and by a* G the optimal dual solution 
to (2). Using the first order condition for optimality, we have 


M* 


1 

XN 


N 

E 

t=i 


a 


^At 


( 3 ) 
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3.1 Dual Random Projection for Distance Metric Learning 

Directly solving the primal problem in (1) or the dual problem in (2) could be computational expen¬ 
sive when the data is of high dimension and the number of training triplets is very large. We address 
this challenge by inducing a random matrix R G where m d and Rij ~ A^(0, 1/m), 

and projecting all the data points into the low dimensional space using the random matrix, i.e., 
Xj = As a result. At, after random projection, becomes At = R^AtR. 

A typical approach of using random projection for DML is to obtain a matrix Mg of size mxm 
by solving the primal problem with the randomly projected vectors i.e. 


A 1 

Given the learned metric Mg, for any two data points x and x', their distance is measured by 
(x — RMgR~^{-X — x') = (x — x'Y' M'{x — x'), where M' = RMgR^ G is the 

effective metric in the original space M'^. The key limitation of this random projection approach 
is that both the column and row space of M' are restricted to the subspace spanned by vectors in 
random matrix R. 

Instead of solving the primal problem, we proposed to solve the dual problem using the ran¬ 
domly projected data points i.e. 


N 

max — > 


1 

2XN 


(x^Ga 


( 5 ) 


where Ga,b = AaR, R^ Ai,R). After obtaining the optimal solution S* for (5), we reconstruct 
the metric by using the dual variables S* and data matrix X in the original space, i.e. 

_ 1 ^ 

t=i 

It is important to note that unlike the random projection approach, the recovered metric M* in (6) is 
not restricted by the subspace spanned by the random vectors, a key to the success of the proposed 
algorithm. 

Alg. 1 summarizes the key steps for the proposed dual random projection method for DML. 
Following one-projection paradigm (Chechik et ah, 2010), we project the learned symmetric ma¬ 
trix M onto the PSD cone at the end of the algorithm. The key component of Alg. 1 is to solve 
the optimization problem in (2) at Step 4 accurately. We choose stochastic dual coordinate ascent 
(SDCA) method for solving the dual problem (5) because it enjoys a linear convergence when the 
loss function is smooth, and is shown empirically to be significantly faster than the other stochastic 
optimization methods (Shalev-Shwartz and Zhang, 2012). We use the combination strategy recom¬ 
mended in (Shalev-Shwartz and Zhang, 2012), denoted by CSDCA, which uses SGD for the first 
epoch and then applies SDCA for the rest epochs. 


3.2 Main Theoretical Results 

First, similar to (Zhang et ah, 2013), we consider the case when the data matrix X is of low rank. 
The theorem below shows that under the low rank assumption, with a high probability, the distance 
metric recovered by Algorithm 1 is nearly optimal. 
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Algorithm 1 Dual Random Projection Method (DuRP) for DML 
1: Input: the triplet constraints V and the number of random projections m. 
2: Generate a random matrix R G and Rij ~ AA(0,1/m). 

3: Project each example as x = i?^x. 

4: Solve the optimization problem (5) and obtain the optimal solution S* 

5: Recover the solution in the original space by M* = — ^ Yht 
6: Output: lipsoiM^) 


Theorem 1 Let M* be the optimal solution to (1). Let S* be the optimal solution for (5), and let 
A'R be the solution recovered from S* using (6). Under the assumption that all the data points lie 
in the subspace of r-dimension, for any 0 < e < 1/6, with a probability at least 1 — 5, we have 

l|np5_D(M*) - np5P)(M*)||p < 

provided m > constant c is at least 1/3. 

The proof of Theorem 1 can be found in appendix. Theorem 1 indicates that if the number of 
random projections is sufficiently large (i.e. m = Q(rlogr)), we can recover the optimal solution 
in the original space with a small error. It is important to note that our analysis, unlike (Zhang et ah, 
2013), can be applied to non-smooth loss such as the hinge loss. 

In the second case, we assume the loss function £(•) is 7 -smooth (i.e., \£'{z)—£'{z')\ < ^\z—z'\). 
The theorem below shows that the dual variables obtained by solving the optimization problem in 
(5) can be close to the optimal dual variables, even when the data matrix X is NOT low rank or 
approximately low rank. For the presentation of theorem, we first define a few imporfanf quanfifies. 
Define mafrices G G G and G R^^^ as 


< 1 , = ll{x?-xS)||ix||(x‘-x‘)||l 

= ||(xf-x5)|gx||(x?-x;)|g 

Ki = ll{x?-xS)ll2X||(x‘-x;)||2 
A47 = ll{x?-x“)||ix||(x''-x*)||i 

Define k fhe maximum of fhe specfral norm of fhe four mafrices, i.e. 

K = max {WM^h, \\M^h, \\M%, \\M%) (7) 

where || • ||2 sfands for fhe specfral norm of mafrices. 


Theorem 2 Assume i{z) is y-smooth. Let a* be the optimal solution to the dual problem in (2), and 
let 3* be the approximately optimal solution for (5) with suboptimality r]. Then, with a probability 
at least 1 — 5, we have 


cr* — cx -*\\2 < max 



where k is define in (7), provided m > ^ In 
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The proof of Theorem 2 can be found in the appendix. Unlike Theorem 1 where the data matrix X 
is assumed to be low rank, Theorem 2 holds without any prior assumption about the data matrix. It 
shows that despite the random projection, the dual solution can be recovered approximately using 
the randomly projected vectors, provided that the number of random projections m is sufficiently 
large, k is small, and the approximately optimal solution fi* is sufficiently accurate. In the case 
when most of the training examples are not linear dependent, we could have k, = @{N/d), which 
could be a modest number when d is very large. The result in Theorem 2 essentially justifies the key 
idea of our approach, i.e. computing the dual variables first and recovering the distance metric later. 
Finally, since ||q:* — S*|| 2 , the approximation error in the recovered dual variables, is proportional 
to the square root of the suboptimality rj, an accurate solution for (5) is needed to ensure a small 
approximation error. We note that given Theorem 2, it is straightforward to bound ||M* — M *||2 
using the relationship between the dual variables and the primal variables in (3). 

4. Experiments 

We will first describe the experimental setting, and then present our empirical study for ranking and 
classification tasks on various datesets. 

4.1 Experimental Setting 


Table 1: Statistics for the datasets used in our empirical study. #C is the number of classes. #F is 
the number of original features. #Train and #Test represent the number of training data 
and test data, respectively. 



#C 

#F 

#Train 

#Test 

usps 

10 

256 

7,291 

2,007 

protein 

3 

357 

17,766 

6,621 

caltechSO 

30 

1,000 

5,502 

2,355 

tdtSO 

30 

1,000 

6,575 

2,819 

20news 

20 

1,000 

15,935 

3,993 

rcvSO 

30 

1,000 

507,585 

15,195 


Data sets Six datasets are used to validate the effectiveness of the proposed algorithm for DML. 
Table 1 summarizes the information of these datasets. caltechSO is a subset of Caltech256 image 
dataset (Griffin ef ah, 2007) and we use fhe version pre-processed by (Chechik et ah, 2010). tdt30 is 
a subset of tdt2 dataset (Cai et ah, 2009). Both caltechSO and tdtSO are comprised of the examples 
from the 30 most popular categories. All the other datasets are downloaded from LIBSVM (Chang 
and Lin, 2011), where rcvSO is a subset of the original dataset consisted of documents from the 30 
most popular categories. For datasets tdtSO, 20news and rcvSO, they are comprised of documents 
represented by vectors of ~ 50,000 dimensions. Since it is expensive to compute and maintain a 
matrix of 50,000 x 50,000, for these three datasets, we follow the procedure in (Chechik et ah, 
2010) that maps all documents to a space of 1,000 dimension. More specifically, we first keep the 
top 20, 000 most popular words for each collection, and then reduce their dimensionality to 1,000 
by using PCA. We emphasize that for several data sets in our test beds, their data matrices can not 
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be well approximated by low rank matrices. Fig. 2 summarizes the eigenvalue distribution of the six 
datasets used in our experiment. We observe that four out of these datasets (i.e., caltech20, tdtSO, 
lOnews, rcvSO) have a flat eigenvalue distribution, indicating that the associated data matrices can 
not be well approximated by a low rank matrix. This justifies the importance of removing the low 
rank assumption from the theory of dual random projection, an important contribution of this work. 



#Index 
(a) usps 




(b) protein 


(c) caltechSO 



#Index 


#Index 


#Index 


(d) tdtSO (e) lOnews (f) rcvSO 

Figure 1: The eigenvalue distribution of datasets used in our empirical study 


For most datasets used in this study, we use the standard training/testing split provided by the 
original datasets, except for datasets tdt30, caltechSO and rcv30. For tdtSO and caltechSO, we ran¬ 
domly select 70% of the data for training and use the remaining 30% for testing; for rcvSO, we 
switch the training and test sets defined by the original package to ensure that the number of train¬ 
ing examples is sufficiently large. 

Evaluation metrics To measure the quality of learned distance metrics, two types of evaluations 
are adopted in our study. First, we follow the evaluation protocol in (Chechik et ah, 2010) and 
evaluate the learned metric by its ranking performance. More specifically, we treat each test instance 
q as a query, and rank the other test instances in the ascending order of their distance to q using the 
leai'ned metric. The mean-average-precision(mAP) given below is used to evaluate the quality of 
the ranking list 



where |(5| is the size of query set, r* is the number of relevant instances for i-th query and P{^ir^j) is 
the precision for the first j ranked instances when the instance ranked at the j-th position is relevant 
to the query q. Here, an instance x is relevant to a query q if they belong to the same class. Second, 
we evaluate the learned metric by its classification performance with k-nearest neighbor classifier. 
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More specifically, for each test instance q, we apply the learned metric to find the first k training 
examples with the shortest distance, and predict the class assignment for k by taking the majority 
vote among the k nearest neighbors. Finally, we also evaluate the computational efficiency of the 
proposed algorithm for DML by its efficiency. 

Baselines Besides the Euclidean distance that is used as a baseline similarity measure, six state- 
of-the-art DML methods are compared in our empirical study: 

• DuOri: This algorithm first applies Combined Stochastic Dual Coordinate Ascent (CSDCA) (Shalev- 
Shwartz and Zhang, 2012) to solve the dual problem in (2) and then computes the distance 
metric using the learned dual variables. 

• DuRP: This is the proposed algorithm for DML (i.e. Algorithm 1). 

• SRP: This algorithm applies random projection to project data into low dimensional space, 
and then it employs CSDCA to learn the distance metric in this subspace. 

• SPCA: This algorithm uses PC A as the initial step to reduce the dimensionality, and then 
applies CSDCA to learn the distance metric in the subspace generated by PCA. 

• OASIS (Chechik et ah, 2010): A state-of-art online learning algorithm for DML that learns 
the optimal distance metric directly from the original space without any dimensionality re¬ 
duction. 

• LMNN (Weinberger and Saul, 2009): A state-of-art batch leai'ning algorithm for DML. It 
performs the dimensionality reduction using PCA before starting DML. 

Implementation details We randomly select N = 100,000 active triplets (i.e., incur the positive 
hinge loss by Euclidean distance) and set the number of epochs to be 3 for all stochastic methods 
(i.e., DuOri, DuRP, SRP, SPCA and OASIS), which yields sufficiently accurate solutions in our 
experiments and is also consistent with the observation in (Shalev-Shwartz and Zhang, 2012). We 
search A in {10“®, 10“"^, 10“^, 10“^} and fix it as 1/N since it is insensitive. The step size of 
CSDCA is set according to the analysis in (Shalev-Shwartz and Zhang, 2012). Lor all stochastic 
optimization methods, we follow the one-projection paradigm by projecting the learned metric onto 
the PSD cone. The hinge loss is used in the implementation of the proposed algorithm. Both OASIS 
and LMNN use the implementation provided by the original authors and parameters are tuned based 
on the recommendation by the original authors. All methods are implemented in Matlab, except for 
LMNN, whose core part is implemented in C, which is shown to be more efficient than our Matlab 
implementation. All stochastic optimization methods are repeated five times and the average result 
over five trials is reported. All experiments are implemented on a Linux Server with 64GB memory 
and 12 x 2.4GHz CPUs and only single thread is permitted for each experiment. 

4.2 Efficiency of the Proposed Method 

In this experiment, we set the number of random projection to be 10, which according to experimen¬ 
tal results in Section 4.3 and 4.4, yields almost the optimal performance for the proposed algorithm. 

Lor fair comparison, the number of reduced dimension is also set to be 10 for LMNN. 

Table. 2 compares the CPUtime (in minutes) of different methods. Notice that the time of 
sampling triplets is not taken into account as it is consumed by all the methods, and all the other 
operators (e.g., random projection and PCA) are included. It is not surprising to observe that DuRP, 

SRP and SPCA have similar CPUtimes, and are significantly more efficient than the other methods 
due to the effect of dimensionality reduction. Since DuRP and SRP share the same procedure for 
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Table 2: CPUtime (minutes) for different methods for DML. All algorithms are implemented in 
Matlab except for LMNN whose core part is implemented in C and is more efficient than 
our Matlab implementation. 



Metric in Original Space 

Metric in Subspace 


DuOri 

DuRP 

OASIS 

SRP 

SPCA 

EMNN 

usps 

77.0±3.2 

0.3±0.0 

6.2±0.2 

0.2±0.0 

0.2±0.0 

14.2 

protein 

214.5±5.8 

0.6±0.0 

8L9±3.3 

0.2±0.0 

0.2±0.0 

488.9 

caltechSO 

1,214.5±229.5 

L3±0.0 

640.4±I2L2 

0.2±0.0 

0.5±0.0 

2,197.9 

tdtSO 

1,029.9±16.8 

0.8±0.0 

I40.8±4.5 

0.2±0.0 

0.4±0.0 

624.2 

20news 

1,212.9±154.3 

l.OiO.O 

2I6.3±48.8 

0.2±0.0 

0.5±0.0 

1,893.6 

rev 30 

1,12L3±79.4 

L3±0.0 

432.5±7.7 

0.2±0.0 

4.2±0.0 

N/A 


computing the dual variables in the subspace, the only difference between them lies in the procedure 
for reconstructing the distance metric from the estimated dual variables, a computational overhead 
that makes DuRP slightly slower than SRP For all datasets, we observe that DuRP is at least 200 
times faster than DuOri and 20 times faster than OASIS. Compared to the stochastic optimization 
methods, LMNN is the least efficient on three datasets (i.e., protein, caltechSO and 20news), mostly 
due to the fact that it is a batch learning algorithm. 

4.3 Evaluation by Ranking 


Table 3: Comparison of ranking results measured by mAP (%) for different metric learning algo¬ 
rithms. 



Metric in Original Space 

Metric in Subspace Metric 


Euclid 

DuOri 

DuRP 

OASIS 

SRP 

SPCA 

EMNN 

usps 

53.6 

67.7±L7 

67.1±L2 

62.5±0.5 

32.6±5.4 

4L6±0.4 

59.8 

protein 

39.0 

47.0±0.1 

49.1 ±0.1 

45.7±0.1 

37.7±0.1 

4L9±0.1 

41.9 

caltechSO 

16.4 

23.8±0.1 

25.5±0.1 

25.4±0.2 

8.1±0.4 

19.5±0.0 

16.3 

tdtSO 

36.8 

65.9±0.2 

69.4±0.3 

55.9±0.1 

1L2±0.3 

49.7±0.2 

66.4 

20news 

8.4 

20.1±0.2 

24.9±0.3 

16.2±0.1 

5.3±0.1 

12.2±0.1 

22.5 

rev 30 

16.7 

65.7±0.1 

63.2±0.2 

68.6±0.1 

12.8±0.4 

46.5±0.0 

N/A 


In first experiment, we set the number of random projections used by SRP, SPCA and the pro¬ 
posed DuRP algorithm to be 10, which is roughly 1% of the dimensionality of the original space. 
For fair comparison, the number of reduced dimension for LMNN is also set to be 10. We measure 
the quality of learned metrics by its ranking performance using the metric of mAP. 

Table. 3 summarizes the performance of different methods for DML. First, we observe that 
DuRP significantly outperforms SRP and SPCA for all datasets. In fact, SRP is worse than Eu¬ 
clidean distance which computes the distance in the original space. SPCA is only able to perform 
better than the Euclidean distance, and is outperformed by all the other DME algorithms. Second, 
we observe that for all the datasets, DuRP yields similar performance as DuOri. The only difference 
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between DuRP and DuOri is that DuOri solves the dual problem without using random projection. 
The comparison between DuRP and DuOri indicates that the random projection step has minimal 
impact on the learned distance metric, justifying the design of the proposed algorithm. Third, com¬ 
pared to OASIS, we observe that DuRP performs significantly better on two datasets (i.e., tdtSO and 
20news) and has the comparable performance on the other datasets. Finally, we observe that for 
all datasets, the proposed DuRP method significantly outperforms LMNN, a state-of-the-art batch 
learning algorithm for DML. We also note that because of limited memory, we are unable to run 
LMNN on datasets rcv30. 

In the second experiment, we vary the number of random projections from 10 to 50. All stochas¬ 
tic methods are run with five trails and Fig. 2 reports the average results with standard deviation. 
Note that the performance of OASIS and DuOri remain unchanged with varied number of projec¬ 
tions because they do not use projection. It is surprising to observe that DuRP almost achieves its 
best performance with only 10 projections for all datasets. This is in contrast to SRP and SPCA, 
whose performance usually improves with increasing number of projections except for the data set 
usps where the performance of SPCA declines when the number of random projections is increased 
from 10 to 30. A detailed examination shows that the strange behavior for SPCA is due to its 
extreme low rank at 30 projections after the learned matrix is projected onto the PSD cone. More 
investigation is needed for this strange case. We also observe that DuRP outperforms DuOri for sev¬ 
eral datasets (i.e. protein, caltechSO, tdtSO and 20news). We suspect that the better performance of 
DuRP is because of the implicit regularization due to the random projection. We plan to investigate 
more about the regularization capability of random projection in the future. We finally poinf ouf thaf 
with sufficiently large number of projections, SPCA is able to outperform OASIS on 3 datasets (i.e., 
protein, tdt30 and 20news), indicating that the comparison result may be sensitive to the number of 
projections. 
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Figure 2: The comparison of different stochastic algorithms for ranking 
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4.4 Evaluation by Classification 

In this experiment, we evaluate the learned metric by its classification accuracy with A:-NN {k = 5) 
classifier. We emphasize fhat fhe purpose of this experiment is to evaluate the metrics learned by 
different DML algorithms, not to demonstrate that the learned metric will result in the state-of-ait 
classification performance^. Similar to the evaluation by ranking, all experiments are run five times 
and the results averaged over five frials wifh sfandard deviation are reported in Fig. 3. We essentially 
have the same observation as that for the ranking experiments reported in Section 4.3 except that 
for most datasets, the three methods DuRP, DuOri, and OASIS yield very similar performance. 

Note the main concern of this paper is time efficiency and the size of learned metric is d x d. It 
is straightforward to store the learned metric efficiently by keeping a low-rank approximation of it. 
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Figure 3: The comparison of different stochastic algorithms for classification 


5. Conclusion 

In this paper, we propose a dual random projection method to learn the distance metric for large- 
scale high-dimensional datasets. The main idea is to solve the dual problem in the subspace spanned 
by random projection, and then recover the distance metric in the original space using the estimated 
dual variables. We develop the theoretical guarantee that with a high probability, the proposed 
method can accurately recover the optimal solution with small error when the data matrix is of 
low rank, and the optimal dual variables even when the data matrix cannot be well approximated 
by a low rank matrix. Our empirical study confirms both the effectiveness and efficiency of the 

2. Many studies (e.g., (Weinberger and Saul, 2009; Xu et al., 2012)) have shown that metric learning do not yield better 
classification accuracy than the standard classification algorithms (e.g., SVM) given a sufficiently large number of 
training data. 
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proposed algorithm for DML by comparing it to the state-of-the-art algorithms for DML. In the 
future, we plan to further improve the efficiency of our method by exploiting the scenario when 
optimal distance metric can be well approximated by a low rank matrix. 


Appendix A. Proof of Theorem 1 

First, we want to prove that G is a good estimation for G. We rewrite Ga,b by Kronecker product: 


Ga,b = (Aa, A) 

= ((x“-x^)®(xf-x^)-(x“-x“)®(x“-x“),(x,^-xt)®(x,'-xt)-(x,^-x5)®(x^-x5)) 

= (Za; Z;,) 


where zt = (x* - x^) ® (xj - x^) - (x* - xp ® (x^ - xp. Define Z = [zi, • • • , zat], we have 


^yt \ ^ {^t ^ \ /Cy, {^t '' 

G = Z^Z. 

Under fhe low rank assumpfion fhaf all fraining examples lie in fhe subspace of r-dimension, 
fhe dafasef X = [xi, • • • , x„] can be decomposed as: 


r 

X = USU = \iU,vJ 

i=l 

where A* is fhe i-fh singular value of X, and u* and Vj are fhe corresponding lefl and right singular 
vectors of X. Given the property of Kronecker product that {A ® B){C ® D) = {AC) ® (BD), 
we have: 



Zt = 

(x- - X^ (g) (x- 

-xl) 

- (Xi - X*) g (x* - X*) 


= 

[ui^i-^i)] ® 

[u{^i - 

- X*)] - [C/(X* - 

xi)]®[u{xi-xi)] 


= 

{UZ>U) [{Scj- 

Xfc)g 

(x* - Xfc) - (x- - 

- X*) (g) (X* - X*)] 

where x\ 

= u^^l 

Define Z = (zi,. 

• • ) Zn), 

where zt = (X* 

- xp g) (X* - xp - (X* 

(x* - xl) 

, we have: 






G = Z^{U^ 0 U^){U (g) U)Z = Z^{Ir ® Ir)Z = Z^Z 

where Ir (g) Ir equals to the identity operator of 

With the random projection approximation, we have: 


Zt = {R^{xj - xi)) 0 {R^{xl - xD) - {R^{xj - x{j)) 0 {R^{x\ - x{j)) 

= [R'^Uixl - xi)] ® [Ri^Uixl - xi)] - [R^U{xl - x^)] ® [R:^U{x{ - x])] 
= {R'^ (g) R'^){U (g) U) [{x\ - xi) (g) {xj - xi) - {x\ - x]) ® (X* - X*)] 

So, 

G = Z^{U^ (^U^){RR{^ ® RR'^){U (^U)Z 
= ZilU^RRi^U] g) [U^RR~^U])Z 

In order to bound the difference between G and G, we need the following corollary: 
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Corollary 3 (Zhang et ah, 2013) Let S € a standard Gaussian random matrix. Then, for 

any 0 < e < 1/2, with a probability 1 — 5, we have 


1 


SS^ -I 


m 


< s 


provided 


where constant c is at least 1/4. 


m > 


(r + 1) log(2r/(5) 


Define A = U~^RR^U — Ir- Using Corollary. 3, with a probability 1 — 5, we have ||A ||2 < e. 
Using the notation A, we have the following expression for G — G 

G-G = ((Ir + A)^{Ir + A)-Ir^Ir)Z 

= (A (g) Jr + /^ (g) A + A (g) A) Z = Z^TZ 


where T = A(g)/r + /r<8>A + A(g)A. Using the fact that the eigenvalue values of A (g) B is given 
by Xi{A)Xj{B), it is easy to verify that, 

||r ||2 < ll^lb + ll^lb + IIAII2IIAII2 

Using the fact that 11A11 2 < s and taking e < 1/6 which results in c > 1/3, with a probability 1 — 5, 
we have 


||r||2 < 3e (8) 

Define L{a) and L{a:) as 

n 

L(a) = -J]4(ai)-^«TGa, 

2=1 

^ ” 1 - 
2=1 

We are now ready to give the proof for Theorem 1. The basic logic is straightforward. Since 
G is close to G, we would expect S*, the optimal solution to L(a), to be close to q*, the optimal 
solution to L{ol). Since both M* and M* are linear in the dual variables a* and S*, we would 
expect M* to be close to M*. 

Since S* maximizes L{ol) over its domain, which means (q* — S*)^VL(q*) < 0, we have 

L(S*) > L(q;*) + - Q*) (9) 

Using the concaveness of L(q) and the fact that a* maximizes L(ol) over its domain, we have: 

L(S*) < L{cxf) + (S* - q:*)’^VL(q;*) - “ Q:*)’^G(S* - a*) 

= L(q;*) - - a*)~^G(S„ - Q*) + (S* - (^VL(a„) - VL(a,^) + VL(a„)^ 

< L(q;*) + - q;*)’^(G - G)(q;*) - ^^(3* - Q:*)"^G(S* - Q*) (10) 
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Combining the inequalities in (11) and (12), we have 

^(S - o:*)^(G - d)a^ > ^(a - Q;*)^G(a - Q*) 
or 

{a^ - a^)^{G - G)q!* > (a* - Q;*)'''G(a* - a*) + (a* - q:*)'''(G - G){a* - a* 
Define p* = Za*, p* = Za*, we have: 

(p* - p*)"^rp* > Up* - p*||| + (p* - p*)'^r(p* - p*) 

Using the bound given in (8), with a probability 1 — 5, we have 


Ip* - p *||2 < 


3c 


1 - 3c 


|P*I|2 


We complete the proof by using the fact 


||M* - M4f = ^IIp* - P*l|2, ||M*||i;’ = ^||p*||2 


and (Stark et al., 1998) 


||np5D(M*) — np5P)(M*)||p < ||M* — M*||p 


Appendix B. Proof of Theorem 2 

Our analysis is based on the following two theorems. 

Theorem 4 (Theorem 2 (Blum, 2005)) Let x G and x = Ti^y^jwhere R G is a 

random matrix whose entries are chosen independently from Af(0, 1). Then: 

Pr{(l-c)||x||^ < Pill < (l + c)||x|||} > l-2exp(^-j(c2-c3)) . 

Theorem 5 (Lemma B-1 (Karoui, 2010)) Suppose M is a real symmetric matrix with non-negative 
entries, and E is a real symmetric matrix such that maxjj- \Eij\ < Then, |poM ||2 <^||M|| 2 , 
where || • ||2 stands for the spectral norm of matrix and E o M is the element-wise product between 
matrices E and M. 

Define L{a.) and L{a.) as 

^ 1 

1=1 

^ ^ 1 ^ 

2=1 

Since £{z) is y-smooth, we have ^*(a) be 7 “^-strongly-convex. Using the fact that a* approxi¬ 
mately maximizes L{a) with r/-suboptimality and ^*(-) is 7 “^-strongly-convex, we have 

L(a*) > L(Q*)-h^^(a* - q;*)’^( 7"P -P G)(a* - a*)-!/ (11) 
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Using the concaveness of L{(y.) and the fact that ct* maximizes L{ol) over its domain, we have: 

L{6l^) < L{a^) + - o^*V{G - G)ct^ - + G)(S* - a*) (12) 

Combining the inequalities in (11) and (12), we have 

r] + (a* - 3*)'''(G' - G)q.^ > -||3* - a^\\l 

7 

when we set A = 1/N. Using the fact (a* — S*)(G — G)q:* < ||q* — 3*||2||q:*||2||G — G|| 2 , we 
have 

lla* - a*\\l < 7||q;* - 3*||2||q:*||2||G - G||2 + 7??, 


implying that 


let* — S *||2 < max 


- G'||2||q:*||2, 


(13) 


To bound ||q* — 3*||2, we need to bound ||G — G|| 2 . To this end, we write the Ga^b as 
Ga,b — 


(x”-<r(x‘-4) + (x?-x“)' (x»-x' 




1 2 


1 2 


1 2 


Jxf-x^) ' (x^xpj - [(x“-x“) ' (x^x^) 

Similarly, we write Ga,b as 

Ga,b = {R^AaR,R^AbR) 

= [(x“-x^)^i?i?^(xj-x^)] V[(x“-xp^i?i?^(x^-x^^)]^ 

- [(x“-x^)Ti?i?T(x,^-x5)]^-[(x“-x“)^i?i?^(x^-x^)]^ 

Hence, we can write G — G = + B^, where B^, B^, B^, and B^ are defined 


as 


Bib = 


Bib = 


Bib = 


Bib = 


1 2 


(x“-x^) ' RW - (x“-x^) ' (x^xD 

2 


1 2 


x“-x“)’^iii?’^(x*-x5) - (x“-x“) ' (x,^-x' 


^^a\T 




x?-Xfc)^(x*-x5) - (xf-x^) ' i?i?'(x,^-x' 


1 2 


o,\T 75 dT /„& 


1 2 


1 2 


X 7 “-x“)"^(x,^-x^) - (x“-x“) ' (x,^-x^) 


1 2 


a D D~r /,^f) 


1 2 


Using the result from Theorem 4 and the definition of matrices M^, M^, M^, M'^, we have, with a 
probability 1 — 5, for any a, b, 

|HU|<6MU, i = {l,2,3,4} 
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provided that e < 1/2 and 


m > 



Using Theorem 5, under the condition in (14), we have, with a probability 1 — 5, 


||G-G|| 2 <e (||m1|| + ||m2|| + ||m3|| + ||M^||) KAke 


(14) 


where the last step uses the definition of k. We complete the proof by plugging the bound for 
||G - G \\2 into (13). 
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